Measuring and Improving the Quality of World Knowledge Extracted From WordNet
نویسندگان
چکیده
WordNet is a lexical database that, among other things, arranges English nouns into a hierarchy ranked by specificity, providing links between a more general word and words that are specializations of it. For example, the word “mammal” is linked (transitively via some intervening words) to “dog” and to “cat.” This hierarchy bears some resemblance to the hierarchies of types (or properties, or predicates) often used in artificial intelligence systems. However, WordNet was not designed for such uses, and is organized in a way that makes it far from ideal for them. This report describes our attempts to arrive at a quantitative measure of the quality of the information that can be extracted from WordNet by interpreting it as a formal taxonomy, and to design automatic techniques for improving the quality by filtering out dubious assertions. This work was funded by NSF research grant IIS-0082928. The authors would like to thank David Ahn, Greg Carlson, and Henry Kyburg for regular discussions during the course of this work, and David Ahn for help labeling data. WordNet [4] was designed as a dictionary and thesaurus for human use. It differs from classical dictionaries and thesauri in that rather than being simply an alphabetized list of self-contained entries, it is richly cross-referenced with relationships that psycholinguistic research indicates are involved in the organization of the human mental lexicon. Of these relationships, the one with the most instances in WordNet is the hyponymy relationship, which links a more specific word (hyponym) like “cat” to a more general one (hypernym) like “mammal.” For a variety of reasons, researchers in artificial intelligence and natural language processing have found taxonomic concept hierarchies to be useful components of their computational systems. The various taxonomies used in AI, while they differ in content, in general have a standard form: they are trees or lattices, in which the nodes represent predicates or properties, and the links indicate subsumption relationships. It is tempting to interpret WordNet’s noun hierarchy in this way, since it is a lattice structure whose nodes are sets of synonymous nouns (nouns in natural language are often taken to express predicates or properties), and whose links indicate a relationship expressed in English as “is a” or “is a kind of,” which seems similar in flavor to the subsumption relationship. Because it has this familiar-looking structure, because it has such broad coverage (some 94,000 nouns expressing some 66,000 unique concepts), and because it is hand-constructed and therefore presumably of high quality compared to taxonomies collected by automatic clustering techniques (e.g. [9, 3]), it is tempting to try to use WordNet as a source of taxonomic knowledge for a logical reasoning system. The creators of WordNet did not have in mind a precise interpretation of the kind typically used in the knowledge representation field, and consequently interpreting it in such a precise way sometimes yields incorrect information. In Section 2, we will list a number of factors that make a formal interpretation of WordNet problematic, but in order to introduce the approach to be taken in this paper, let us consider one example. In WordNet, the words ‘gold’ and ‘noble metal’ are linked by the same relationship that links ‘noble metal’ with ‘metallic element,’ namely the hyponymy relationship. According to the consensus in semantics, in the sentence “Gold is a noble metal,” the word ‘gold’ names an individual, the phrase ‘noble metal’ names a set (or a property), and the sentence is an assertion that the individual is a member of the set (or that the individual instantiates the property). In the sentence “A noble metal is a metallic element,” ‘noble metal’ and ‘metallic element’ both name sets (or properties), and the sentence is an assertion that the first set is a subset of the second (or that instances of the first property are also instances of the second). A knowledge representation useful for logical inference must differentiate between the subset and membership relations, because they have different entailments. Nevertheless, since WordNet contains such a wealth of information, it may be useful to use it as if it conformed to some more precise interpretation, as long as the performance of the system using the information degrades gracefully when given occasional incorrect information. A primary goal of the work described in this paper was to measure the quality of the information that can be obtained from WordNet in this way. The intention was to define a precise interpretation of the sort usually used in computational knowledge representations, and to measure by statistical sampling the proportion of assertions in WordNet that are false under that strict interpretation. This proportion could then be used as the degree of confidence that a system places in information extracted from WordNet. As a secondary goal, we hoped to find ways of improving this proportion by automatically identifying assertions that are likely to be false under the imposed interpretation. Work towards the first goal could feed into the second: in the process of sampling and evaluating the truth of assertions from WordNet, we would begin to understand the ways in which WordNet tends to deviate from the interpretation we had imposed, and we would also be creating a corpus of labeled instances that might be useful as training data for a classifier that could automatically identify assertions likely to be false under the imposed interpretation. The first step in this proposed course of research, therefore, would be to fix a precise interpretation that could be imposed on WordNet. We originally assumed that this would be trivial, but it turned out to be one of the primary problems, and one which we still have not solved. Imposing a precise interpretation on WordNet includes two subproblems. The first, which is easily accomplished, is to define a semantics for the representation. This means specifying what sort of object a node
منابع مشابه
Automatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملA Comparative study of sport spaces in selected universities of the world
Nowadays, sport complexes are one of the most important parts of the universities that play a key role in improving the quality of studentschr('39') life and enhancing their desirable living standards, so that It is important to pay attention to the concept of forming these spaces throughout history and to suggest ways of promoting their position in universities and clarify their criteria and s...
متن کاملA New WordNet Enriched Content-Collaborative Recommender System
The recommender systems are models that are to predict the potential interests of users among a number of items. These systems are widespread and they have many applications in real-world. These systems are generally based on one of two structural types: collaborative filtering and content filtering. There are some systems which are based on both of them. These systems are named hybrid recommen...
متن کاملEvaluation of WordNet as a source of lay knowledge for molecular biology and genetic diseases: A feasibility study
OBJECTIVES While several sources of biomedical knowledge are available, these resources are often highly specialized and usually not suitable for a lay audience. This paper evaluates whether concepts needed for molecular biology and genetic diseases are present in WordNet, the electronic lexical database. METHODS Terms for four broad categories of concepts (phenotype, molecular function, biol...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001